Skip to content

Conversation

@yiming0416
Copy link
Contributor

@yiming0416 yiming0416 commented Feb 10, 2026

This PR merges the simple_fsdp and compiler_toolkit experiments into a new unified experiment called graph_based_training (name to be discussed later).

The two experiments shared the same DTensor-based SimpleFSDP model authoring but had separate compilation paths: simple_fsdp used JIT compilation (torch.compile) and compiler_toolkit used AOT joint graph capture. The new experiment unifies them under a single compile.mode config field ("jit" or "aot"), with a shared pass registry that validates pass/mode compatibility.

This PR creates a separate new folder. No existing files in simple_fsdp/ or compiler_toolkit/ are modified.

File change breakdown

Files copied without changes:

  • simple_fsdp.py — Copied from simple_fsdp/simple_fsdp.py.
  • reshard_after_forward.py — Copied from simple_fsdp/reshard_after_forward.py.
  • cudagraph.py — Copied from compiler_toolkit/cudagraph.py.

Files copied with import path changes only

  • common_utils.py — Adapted from compiler_toolkit/common_utils.py.
  • graph_utils.py — Adapted from compiler_toolkit/graph_utils.py.
  • jit_backend.py — Adapted from simple_fsdp/backend.py.
  • train.py — Adapted from compiler_toolkit/train.py.
  • llama3/__init__.py — Adapted from simple_fsdp/llama3/__init__.py.
  • llama3/model.py — Adapted from simple_fsdp/llama3/model.py.
  • deepseek_v3/__init__.py — Adapted from simple_fsdp/deepseek_v3/__init__.py.
  • deepseek_v3/model.py — Adapted from simple_fsdp/deepseek_v3/model.py.

Files adapted with non-trivial changes

  • passes.py — Unified pass registry
  • compilation.py — Unified compilation dispatcher routing to _apply_jit() and _apply_aot()
  • job_config.py — Merged config with mode aot, jit
  • llama3/parallelize.py — Unified parallelize function merging logic from simple_fsdp and compiler_toolkit
  • deepseek_v3/parallelize.py — Same as above, but for DSv3.
  • tests/integration_tests.py — Merged integration tests from both experiments.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 10, 2026
@yiming0416 yiming0416 force-pushed the yiming/consolidate_experiments branch from 57f47e9 to 00b5640 Compare February 10, 2026 01:53
@yiming0416 yiming0416 force-pushed the yiming/consolidate_experiments branch from 00b5640 to 17b18ec Compare February 10, 2026 02:02
@yiming0416 yiming0416 changed the title [DRAFT] Consolidate simple_fsdp and compiler_toolkit experiments [RFC] Consolidate simple_fsdp and compiler_toolkit experiments Feb 10, 2026
@yiming0416 yiming0416 marked this pull request as ready for review February 10, 2026 17:45
@yiming0416 yiming0416 requested a review from aditvenk February 10, 2026 19:06
@fegin
Copy link
Contributor

fegin commented Feb 11, 2026

Can we use git mv to preserve the history if a file is copied from the original folder? We can deprecate the two experiments in this PR. With git mv it is more clear to see what is going on with this change.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I'm doing a massive refactoring of torchtitan config system. I'm almost done (with the first version) and I'd prefer we rebase after I'm done.. (pray)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants